Skip to content

Conversation

@hpopuri2
Copy link

@hpopuri2 hpopuri2 commented Jan 27, 2026

Add Valkey Distributed Cache for Horizontal Scaling

Summary

This PR implements distributed caching using Valkey to enable horizontal scaling of Trino Gateway. Multiple gateway instances can now share query metadata through a distributed cache layer, ensuring consistent query routing across all
instances.

Motivation

Currently, Trino Gateway uses local Guava caches that are not shared between instances. In multi-instance deployments, this can lead to:

  • Inconsistent query routing when requests hit different gateway instances
  • Cache misses requiring expensive database lookups
  • Inability to leverage cache across horizontally scaled deployments

This implementation addresses these limitations while maintaining backward compatibility and graceful degradation.

Architecture

3-Tier Caching Strategy

Request Flow:

  1. L1 Cache (Local Guava) → ~1ms
    ├─ Hit: Return immediately
    └─ Miss: Check L2
  2. L2 Cache (Valkey Distributed) → ~5ms
    ├─ Hit: Populate L1, return
    └─ Miss: Check L3
  3. L3 Cache (PostgreSQL Database) → 50ms
    ├─ Found: Populate L2 + L1, return
    └─ Not Found: Search backends via HTTP (200ms)

Cache Keys

  • trino:query:backend:{queryId} - Backend URL for query
  • trino:query:routinggroup:{queryId} - Routing group for query
  • trino:query:externalurl:{queryId} - External URL (lazy-loaded)

Implementation Details

Core Components

ValkeyConfiguration (gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java)

  • 11 configurable parameters with sensible defaults
  • Input validation (port range, positive values, health check intervals)
  • Convention over Configuration - only 3 params required (enabled, host, port)

ValkeyDistributedCache (gateway-ha/src/main/java/io/trino/gateway/ha/router/ValkeyDistributedCache.java)

  • Implements DistributedCache interface
  • JedisPool connection pooling with configurable pool size
  • Modern Duration API (no deprecated methods)
  • Graceful degradation when disabled or unhealthy
  • Periodic health checks (PING) with configurable interval
  • Metrics tracking: hits, misses, writes, errors, hit rate

DistributedCache Interface (gateway-ha/src/main/java/io/trino/gateway/ha/router/DistributedCache.java)

  • Clean abstraction for cache operations (get, set, delete, isHealthy)
  • Enables future alternative implementations

Integration

BaseRoutingManager - Updated routing logic:

  • Write-through caching for backend and routing group (cache on write)
  • Lazy-loading for external URLs (cache on first read)
  • Cache key documentation with Javadoc
  • Graceful fallback to database when cache unavailable

HaGatewayProviderModule - Dependency injection:

  • Provides DistributedCache singleton
  • Wires ValkeyConfiguration to ValkeyDistributedCache

Configuration

Minimal (Recommended for Getting Started)

  ```yaml
  valkeyConfiguration:
    enabled: true
    host: localhost
    port: 6379
    # password: ${VALKEY_PASSWORD}  # Optional: if AUTH required
    # cacheTtlSeconds: 1800  # Optional: Cache TTL (default: 1800 = 30 minutes)

      

Advanced (Production Tuning)

valkeyConfiguration:
  enabled: true
  host: valkey.internal.prod
  port: 6379
  password: ${VALKEY_PASSWORD}
  database: 0
  maxTotal: 100              # More connections for high concurrency
  maxIdle: 50
  minIdle: 25
  timeoutMs: 5000            # Longer timeout for slower networks
  cacheTtlSeconds: 3600      # 1 hour for long-running queries
  healthCheckIntervalMs: 60000  # 1 minute health checks

Single Instance (No Changes Required)

valkeyConfiguration:
enabled: false # Default - local cache sufficient

Testing

Unit Tests (31 total, all passing)

TestValkeyConfiguration (16 tests)

  • Default values verification
  • Setter/getter correctness
  • Input validation (port range, positive values, etc.)
  • Edge cases (null password, various database indices)

TestValkeyDistributedCache (15 tests)

  • Disabled cache behavior (returns empty, fails gracefully)
  • Invalid host handling (marks unhealthy)
  • Close operations (safe cleanup)
  • All cache operations (get, set, delete with variations)
  • Null/empty array handling
  • Password and database configuration
  • Pool configuration acceptance

Integration Tests (existing tests updated)

  • All routing manager tests updated with distributed cache
  • NoopDistributedCache for tests not requiring real cache
  • No regression in existing functionality

Documentation

Comprehensive documentation added:

New File: docs/valkey-configuration.md (273 lines)

  • Quick start with minimal config
  • Full configuration reference
  • Deployment scenarios (single vs. multi-instance)
  • Performance tuning guidelines
  • Connection pool sizing recommendations
  • Monitoring and troubleshooting
  • Security checklist
  • Architecture explanation
  • Migration guide
  • FAQ

Updated Files:

  • docs/installation.md - Added "Configure distributed cache" section
  • docs/operation.md - Added "Multi-instance deployments" monitoring section
  • mkdocs.yml - Added navigation entry for Valkey docs

Backward Compatibility

✅ Fully backward compatible

  • Disabled by default (enabled: false)
  • No changes required to existing configs
  • Single-instance deployments work exactly as before
  • Existing tests pass without modification (after updating constructors)

Migration Path

From Single to Multi-Gateway

  1. Deploy Valkey server
  2. Update config.yaml on all gateways:
    valkeyConfiguration:
    enabled: true
    host: valkey.internal
    port: 6379
    password: ${VALKEY_PASSWORD}
  3. Rolling restart gateways
  4. Monitor cache hit rates

No data migration needed - cache populates automatically.

Graceful Degradation

When Valkey is unavailable:

  • ✅ Queries continue working
  • ✅ Falls back to database lookups
  • ✅ Logs warnings (not errors)
  • ✅ Marks cache as unhealthy
  • ✅ Auto-recovery when Valkey returns

Monitoring

Cache metrics available via ValkeyDistributedCache:

  • getCacheHits() - Total cache hits
  • getCacheMisses() - Total cache misses
  • getCacheWrites() - Total successful writes
  • getCacheErrors() - Total operation errors
  • getCacheHitRate() - Hit rate percentage (0-100)

Future work: Expose these via /metrics endpoint for Prometheus.

Dependencies

Added: io.valkey:valkey-java:5.5.0

  • Valkey is a Redis fork with compatible protocol
  • Works with both Valkey and Redis servers
  • Apache 2.0 licensed

Files Changed

New Files (7)

  • gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java
  • gateway-ha/src/main/java/io/trino/gateway/ha/router/DistributedCache.java
  • gateway-ha/src/main/java/io/trino/gateway/ha/router/ValkeyDistributedCache.java
  • gateway-ha/src/test/java/io/trino/gateway/ha/config/TestValkeyConfiguration.java
  • gateway-ha/src/test/java/io/trino/gateway/ha/router/NoopDistributedCache.java
  • gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCache.java
  • docs/valkey-configuration.md

Modified Files (16)

  • Configuration: gateway-ha/config.yaml, docs/config.yaml
  • Core: HaGatewayConfiguration.java, HaGatewayProviderModule.java, BaseRoutingManager.java
  • Routing: QueryCountBasedRouter.java, StochasticRoutingManager.java, ProxyRequestHandler.java
  • Tests: 4 routing manager tests updated
  • Docs: installation.md, operation.md, mkdocs.yml

Checklist

  • Code follows project style guidelines
  • No deprecated APIs used (Duration instead of milliseconds)
  • Comprehensive input validation
  • Unit tests written and passing (31 tests)
  • Integration tests updated and passing
  • Documentation complete and accurate
  • Configuration examples provided
  • Backward compatible (disabled by default)
  • Graceful degradation implemented
  • Security considerations documented
  • Performance tuning guidelines included

Future Enhancements

  • Expose cache metrics via /metrics OpenMetrics endpoint
  • Add TLS/SSL support for Valkey connections
  • Support Redis Cluster mode
  • Add cache warming on startup
  • Implement cache eviction strategies
  • Add circuit breaker pattern for cache failures

Testing Instructions

Local Testing (Single Instance)

No changes needed - works as before

java -jar gateway-ha.jar config.yaml

Multi-Instance with Valkey

Start Valkey

docker run -d -p 6379:6379 valkey/valkey:latest

Update config.yaml

valkeyConfiguration:
enabled: true
host: localhost
port: 6379

Start multiple gateways

java -jar gateway-ha.jar config.yaml

Verify Cache Working

This commit adds Valkey (Redis-compatible) distributed cache support to enable
horizontal scaling of Trino Gateway across multiple instances.

Key features:
- 3-tier caching: Guava (local) -> Valkey (distributed) -> PostgreSQL (persistent)
- Graceful degradation if Valkey is unavailable
- Connection pooling with health checks
- Configurable TTL and connection parameters
- Comprehensive documentation

New files:
- ValkeyConfiguration.java: Configuration class
- DistributedCache.java: Cache interface
- ValkeyDistributedCache.java: Valkey implementation
- NoopDistributedCache.java: Test implementation
- docs/valkey-configuration.md: Complete documentation

Modified files:
- Integrated cache into routing managers
- Updated dependency injection
- Added Jedis dependency
- Updated configuration files
- Updated tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

1 participant